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Abstract 

Genome-wide association analysis has generated much discussion about how to preserve power 
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to detect signals despite the detrimental effect of multiple testing on power. We develop a weighted 
multiple testing procedure that facilitates the input of prior information in the form of groupings 
of tests. For each group a weight is estimated from the observed test statistics within the group. 
Differentially weighting groups improves the power to detect signals in likely groupings. The 
advantage of the grouped- weighting concept, over fixed weights based on prior information, is that 
it often leads to an increase in power even if many of the groupings are not correlated with the 
signal. Being data dependent, the procedure is remarkably robust to poor choices in groupings. 
Power is typically improved if one (or more) of the groups clusters multiple tests with signals, 
yet little power is lost when the groupings are totally random. If there is no apparent signal in a 
group, relative to a group that appears to have several tests with signals, the former group will be 
down-weighted relative to the latter. If no groups show apparent signals, then the weights will be 
approximately equal. The only restriction on the procedure is that the number of groups be small, 
relative to the total number of tests performed. 

Key Words: Bonferroni correction, Genome-wide association analysis, Multiple testing, Weighted 
p- values. 
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Thorough testing for association between genetic variation and a complex disease typically 
requires scanning large numbers of genetic polymorphisms. In a multiple testing situation, such 
as a whole genome association scan, the null hypothesis is rejected for any test that achieves a 
p-value less than a predetermined threshold. To account for the greater risk of false positives, this 
threshold is more stringent as the number of tests conducted increases. To bolster power, recent 
statistical methods suggest up-weighting and down-weighting of hypotheses, based on prior like- 
lihood of association with the phenotype (Genovese et al. 2006, Roeder et al. 2006). Weighted 
procedures multiply the threshold by the weight w, for each test, raising the threshold when w > 1 
and lowering it if w < 1. To control the overall rate of false positives, a budget must be imposed 
on the weighting scheme. Large weights must be balanced with small weights, so that the aver- 
age weight is one. These investigations reveal that if the weights are informative, the procedure 
improves power considerably, but, if the weights are uninformative, the loss in power is usually 
small. Surprisingly, aside from this budget requirement, any set of non-negative weights is valid 
(Genovese et al. 2006). While desirable in some respects, this flexibility makes it difficult to select 
weights for a particular analysis. 

The type of prior information readily available to investigators is often non-specific. For in- 
stance, SNPs might naturally be grouped, based on features that make various candidates more 
promising for this disease under investigation. For a brain-disorder phenotype we might cross- 
classify SNPs by categorical variables such as those displayed in Table I. The SNPs in Q\ seem 
most promising, a priori, while those in seem least promising. Those in Q 2 and Q 3 axe more 
promising than those in Q 4 , but somewhat ambiguous. It is easy to imagine additional variables that 
further partition the SNPs into various classes that help to separate the more promising SNPs from 
the others. While this type of information lends itself to grouping SNPs, it does not lead directly 
to weights for the groups. Indeed it might not even be to possible to choose a natural ordering of 
the groups. What is needed is a way to use the data to determine the weights, once the groups are 
formed. 
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Functional Non-Functional 
Brain expressed Q 1 Q 2 

Non-Brain expressed Q 3 Q 4 

Table I. 

Until recently, methods for weighted multiple-testing required that prior weights be developed 
independently of the data under investigation (Genovese et al. 2006, Roeder et al. 2006). In this 
article we ask the following questions: if the weights are to be applied to tests grouped by prior 
information, what choice of weights will optimize the average power of the genetic association 
study? How can we estimate these weights from the data to achieve greater power without affecting 
control of the family-wise error rate? 

Methods 

Consider m hypotheses corresponding to standardized test statistics T = (Ti, . . . , T m ). The p- 
values associated with the tests are (Pi, . . . , P m ). We assume Tj is approximately normally dis- 
tributed with non-centrality parameter or the tests are x 2 distributed with non-centrality param- 
eter £?. When using a Bonferroni correction for m tests, the threshold for rejection is achieved if 
the p-value Pj < ct/m. The weighted Bonferroni procedure of Genovese, Roeder and Wasserman 
(2005) is as follows. Specify nonnegative weights w = (w 1: . . . , w m ) and reject hypothesis Hj if 

jeK=\j-- %-<-}■ (l) 

As long as m~ l ^ — 1> this procedure controls family- wise error rate at level a. For a test of 
£j = vs. £j 7^ 0, the power of a single weighted test is 

, ( ,, roj) ^ r(S) _, )+ir(S)+ ,), 

where is the upper tail probability of a standard normal cumulative distribution function. 
When the alternative hypothesis is true, weighting increases the power when Wj > 1 and decreases 
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the power when Wj < 1. We call 7r(£j, Wj) the per-hypothesis power. For signals (£1, . . . , £ m ) and 
weights (wi, . . . , it; m ) the average power is 



mi 
3=1 

The optimal weight vector w = (w\, . . . , w m ) that maximizes the average power subject to 



> and m 1 Y^T=i w j = 1 is (Wasserman and Roeder 2006) 



«>(&) = — * f +u > (2) 



where c is the constant that satisfies the budget criterion on weights 

= L (3) 

J'=l 

The optimal weights vary with the signal strength in a non-monotonic manner (Figure [T]). For 
any particular sample, c adjusts the weights to satisfy the budget constraint on weights. In so doing, 
it shifts the mode of the weight function from left to right depending on the number of small, versus 
large, signals observed. 

The optimal weight function has an interesting effect on the rejection threshold. This choice 
of weights results in a threshold for rejection that varies smoothly with the signal strength. Figure 
|2] plots the rejection threshold — log 10 (awj/m), calculated for the data displayed in Figured! 
as a function of the signal strength and contrasts it with the rejection threshold of a Bonferroni 
corrected test — log 10 (a/m). From Figures [THU it is evident why an optimally weighted test has 
greater power than a non-weighted test. The weighted-threshold is less stringent for signals in the 
midrange, and more stringent for both large and small signals. Consequently, if the signal is likely 
to be very strong or very weak, the test is down-weighted (weight less than one). In practice, little 
power is lost by this tradeoff. For small signals the chance of rejecting the hypothesis is minimal 
with or without weights. For large signals the p-value is likely to cross the threshold regardless of 
the weight. Larger weights are focused in the midrange to help to reveal signals that are marginal. 



Clearly is not known, so it must be estimated to utilize this weight function. A natural choice 
is to build on the two stage experimental design (Satagopan and Elston RC 2003; Wang et al. 2006) 
and split the data into subsets, using one subset to estimate £ i5 and hence and the second to 

conduct a weighted test of the hypothesis (Rubin et al. 2006). This approach would arise naturally 
in an association test conducted in stages. It does lead to a gain in power relative to unweighted 
testing of stage 2 data; however, it is not better than simply using the full data set without weights 
for the analysis (Rubin et al. 2006; Wasserman and Roeder 2006). These results are corroborated 
by Skol et al. (2005) in a related context. They showed that it is better to use stages 1 and 2 jointly, 
rather than using stage 2 as an independent replication of stage 1. 

To gain a strong advantage with data-based weights, prior information is needed. One option 
is to order the tests (Rubin et al. 2006), but with a large number of tests this can be challenging. 
Another option is to group tests that are likely to have a signal, based on prior knowledge, as 
follows: 

1. Partition the tests into subsets Gi, ■ ■ ■ , Gk, with the k'th group containing r k elements, en- 
suring that r k is at least 10-20. 

2. Calculate the sample mean Y k and variance S k for the test statistics in each group. 

3. Label the i'th test in group k, T ik . At best only a fraction of the elements in each group will 
have a signal, hence we assume that for % = 1, . . . , r k the distribution of the test statistics is 
approximated by a mixture model 

T tk ~(l-7r k )N(0,l) + 7r k N(Z k ,l) 

or 

^~(l-7r fe )x?(0)+7r fcX ?(4 2 ) 
where £ k is the signal size for those tests with a signal in the k'th group. (This is an approx- 
imation because the signal is likely to vary across tests.) 
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4. Estimate (7r fc , £ fc ) using the method of moments estimator. For the normal model this is 

«* = Y k/(Y k 2 + si - 1), & = nM, 

provided 7? fc > l/ r fc; otherwise = 0. 

For the \ 2 model £ 2 is a root of the quadratic equation x 2 — bx + 1 = where 6 = (S 2 . — 
1)/ (Y k — 1) +Y k — 5. If both roots are negative, £f = 0; otherwise, 7? fc = - l)/£fc- 

5. For each of the groups, construct weights u>(£fc). Then, to account for excessive variability 
in the weights, induced by variability in smooth the weights by taking a 

w fc = 0.95w(&) + 0.05 AT" 1 ^ wfa). 

k 

Renorm weights if necessary to ensure the weights sum to m. Each test in group k receives 
the weight w k - 

This weighting scheme relies on data-based estimators of the optimal weights, but with a parti- 
tion of the data sufficiently crude to preserve the control of family-wise error rate. The approach is 
an example of the "sieve principle". More formally this result is stated in the following Theorem. 

Theorem. Let b m = ^ J2 k y/r^. If YlT=i "% = m ' men H CD controls family-wise error at level 
a + 0(b m ). Proof is in the Appendix. 

This result establishes control of family-wise error at level a, asymptotically, provided 
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.0, as m -> oo. 

The inflation term in the error rate is near zero under a number of circumstances. Loosely speak- 
ing, the requirement is that each group contains a sufficient number of elements to permit valid 
estimation of For instance, if each group has the same number of elements r k = r, then 



b m = 1/ y/r, which goes to zero, provided the number of groups grows more slowly than the 
number of tests performed. Likewise, b m — > if max{- s /¥k} j minjrfc} — > 0. 

Figure [3] illustrates how w(£k) varies with £ k and the sample variances (weight is proportional 
to the diameter of the circle). Notice that weight increases as a function of the signal until it 
becomes fairly large and then declines. 

Results 

To simulate a large scale study of association, we generate test statistics from m = 10, 000 tests 
with mi = 50 and 100 tests having a signal (£j > 0) and m = m — m\ following the null 
hypothesis. These choices were made to simulate the second stage of a two-stage genome-wide 
association study, with about 1/3-1% of the initial SNPs tested at stage 2. In the proximity of a 
causal SNP, clusters of tests tend to exhibit a signal. We simulate the data as if 5-10 additional 
SNPs were in the proximity of each causal SNP. Thus, if 10-20 actual causal variants are present 
in the genome, approximately 50 to 100 tests might be associated with the phenotype at varying 
levels of intensity. 

The simulated signal strengths vary over 5 levels . . . , £ 5 ) = £ x (1, 1.5, 2, 2.5, 3) withmi/5 
realizations of each of the 5 levels of signals. The m simulated tests are grouped into categories 
Gi, ■ ■ ■ , Gk with the groupings formed to convey various levels of informativeness. Let be the 
signal of the z'th element in group k, be the mean in group k, and be the mean of the whole 
set, respectively. The information in a prior grouping is summarized by the R 2 

The 10, 000 tests are grouped into 10 categories. We start the process by dividing the m tests 
that do not have a signal randomly into 5 equal sized groupings, Gx,---,G§- Now mi tests remain 
to constitute the remaining 5 categories, Ge, ■ ■ ■ , Gio- We create the ideal partition of these tests by 
placing all tests with a common value of in the same category. Next, to create more realistic 
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groupings, we move some tests from categories 1-5 into 6-10 and vice versa. Specifically, we move 
a fraction p of the m null tests to categories 6-10, and distribute them evenly. Likewise we move 
a fraction p 1 of the mi tests with £ > to categories 1-5, and distribute them evenly. By varying 
(PojPi) we obtain various levels of informativeness of the groupings, reflecting priors of various 
value. 

To see the effect of including null loci in the same grouping as the SNPs with true effects, we 
fi x (£o — 2,pi = 0, mi = 100) and vary p . Setting p = 0.5 (0.1) increases the elements of 
groups 6-10 to 1,010 (218), but only 20 are true alternatives. For p = 0.01, 0.1, 0.25, and 0.5 we 
find a difference in power (weighted minus the unweighted procedure) of 14, 5, 0, and -3 percent, 
respectively. So, for p > 0.25 there is a loss in power, but it is relatively small. 

Next we explore the effect of failing to place the true effects in the more promising categories 
(6-10). To do so, we fix (£ = 2,p = -l,mi = 100) and vary p\. For p\ = 0.05. 0.1, 0.5, and 
0.9, we find a difference in power of 7, 3, 2, -5 and -2 percent, respectively. Even when 90% of 
the true alternatives are grouped with large numbers of nulls in groups 1-5, the loss in power is 
relatively small. Another interesting feature is that a 50% swap leads to a greater loss in power than 
a 90% swap. The latter occurs because weights are approximately constant across groups when 
the alternatives are scattered nearly at random. When half of the alternatives are in the promising 
groups, these categories are up- weighted at the expense of the other categories. This balance can 
lead to a net loss in power, relative to the unweighted test. 

Figure |4] displays the difference in power as a function of R 2 . The proportion of null tests in 
cells 1-5, and alternative tests in cells 6-10 varies: po G [0.01 — 0.5] and p\ E [0.01 — 0.95]. From 
these simulations we see that, provided p Q < 0.5 and pi > 0.1, the weighted method is generally 
more powerful than the unweighted method (plot symbol "o"). Two exception occur; both have 
R 2 less than 2% of the variability in signal. For R 2 near the loss in power from poorly selected 
groupings is modest. Deviations in p\ from ideal have a greater impact than deviations of po 
(plot symbol "★" vs. "+")• This asymmetry is expected because groups (1-5) contain many more 
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elements than groups 6-10. Consequently signals can be swamped by nulls in these groupings. 
Finally we tried mixing the various levels of true alternatives £ x (1, 1.5, 2, 2.5, 3) among groups 
6-10 and found that this had a negligible effect on the power (results not shown). 

Discussion 

Whole genome analysis has generated much discussion about power, the effect of multiple testing 
on power, and various multistage experimental designs (e.g., Wang et al. 2006). We investigate the 
performance of a weighting scheme that allows for the input of weak prior information, in the form 
of groupings of tests, to improve power in large scale investigations of association. The method can 
be applied at any stage of an experiment. The beauty of the grouped-weighting concept is that it 
is likely to lead to an increase in power, provided multiple tests with signals are clustered together 
in one (or more) of the groups. Little power is lost when many groups contain no true signal. This 
remarkable robustness is achieved because the procedure uses the observed test statistics in the 
grouping to determine the weight. If there is no apparent signal, the group will be down-weighted. 
The only restriction on the procedure is that the number of groups be small, relative to the total 
number of tests performed. 

Using groupings and weights to interpret the many tests conducted in a large scale association 
study has potential, regardless of power lost when weights are poorly chosen. Typically some SNPs 
are favored due to knowledge gleaned from the literature and prior investigations. When seemingly 
random SNPs produce smaller p-values than the favored candidates, one is baffled about how to 
handle the situation. Moreover, it often happens that promising candidate SNPs do produce small 
p-values, but these p-values might not be small enough to cross the significance threshold when 
a Bonferroni correction is applied. After the huge investment of a whole genome scan it would 
be foolhardy not to pursue both (i) SNPs that produce tiny p-values and (ii) SNPs that produce 
respectable p-values that would have been significant had a formal weighting scheme been utilized 
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to incorporate prior information. We suggest using the weighting method of analysis described 
here as a way to formalize the incorporation of prior information. 

Weights can be incorporated into various multiple testing procedures, including false discovery 
methods. This paper considers controlling family-wise error rate, but similar results hold for false 
discovery control (Benjamini and Hochberg 1995) and will be pursued elsewhere. 

Appendix 

Proof of Theorem 1 . Let Ho denote the set of indices for which ^ = 0. With fixed weights, the 
family-wise error is 

F((U nn )>0) = P \Pj < for some j G H J 

P [Pj < — 3 - = — > Wj < aw = a. 

jen jen 

The estimated signal in the group occupied by the j'th test, is estimated from a sample of r k test 
statistics, consequently £ fe = £ k + O ( r fc ■ Thus with random weights 



f((k nn )>o) < ^ p p 3 - < 



jeHo 



m 



< a(l + 0(b m )). 
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Figure 2: Threshold for rejecting P-values versus signal strength. The log w p-value is rejected if it 
is larger than the threshold. For this illustration m = 100, 000 and a = 0.05. The unweighted Bon- 
ferroni has a constant threshold value (horizontal line). The weighted threshold varies as a function 
of the weight (curved line). The optimal weight is calculated as a function of the (estimated) signal 
strength. 
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Figure 3: Weight as a function of £ fc and variance. The diameter of the circle indicates relative 
weight. 
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Figure 4: Net power different between weighted Bonferroni and unweighted, as a function of R 2 . 
The worst cases are p = 0.5 (plot symbol +) and p 1 > 0.1 (plot symbol *). The remaining models 
have plot symbol o. 
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